86 ◾ Bioinformatics
mapping because the aligner should also be able to detect the splice junctions. The process
of alignment is usually complicated by the possible existence of mismatches, which may
be due to base call errors or due to genetic variations in the individual genome. In general,
aligners are required to use a strategy that enables them to perform both an exact search
and an inexact search to allow locating positions of reads with mismatches. Almost all
read aligners perform alignment in two major steps: indexing of the sequence of the refer-
ence genome and finding the most likely locations of the reads in the reference genome.
The FASTA sequence of the reference of an organism can be downloaded from genome
databases such as NCBI Genome and UCSC database. The FASTA genome sequence is
indexed first by the “samtools faidx” command to allow fast processing by the aligners.
The commonly used data structures for genome indexing include BWT, FM-index, suf-
fix arrays, and hash table for their memory efficiency and capability to store a genome
sequence. There are a variety of aligners that use different indexing and lookup algorithms.
We discussed only BWA, Bowtie2, and STAR. However, those are only examples. Before
using an aligner, you may need to know its memory efficiency and whether it is capable to
use short reads, long reads, or both. If you have RNA-Seq reads, you may also need to know
whether that aligner is capable to detect splice junctions or not. Both BWA and Bowtie2 are
general purpose aligners that can be used for all kinds of reads and they can operate well
on a desktop computer with 32GB of RAM or more. STAR is better for RNA-Seq read and
it can also run on a desktop computer but it requires much more memory for both index-
ing and mapping.
Almost all aligners produce SAM/BAM files, which store read mapping information.
A SAM/BAM file consists of a header section and an alignment section. The alignment
section includes nine mandatory columns; each row of the columns contains the mapping
information of a read. The alignment information includes read name, FLAG, reference
sequence name (e.g., chromosome name or accession), position in the reference sequence
(coordinate), mapping quality, CIGAR string, reference name of the mate, position of the
mate, read length, segment of the read sequence, and Phred base quality. FLAG field stores
standard codes that describe the alignment (e.g., unmapped reads, duplicate reads, and
chimeric alignments). The CIGAR string describes the operations that took place on the
reads such as matches, mismatches, insertions, and deletions.
SAM/BAM files can be manipulated by some programs like Samtools and PICARD.
The file manipulation includes format conversion, indexing, sorting, displaying, statistics,
viewing, and filtering.
The SAM/BAM files are used in the downstream data analysis such as reference-guided
genome assembly, variant discovery, gene expression (RNA-Seq data analysis), epigenetics
(ChIP-Seq data analysis), and metagenomics as we will discuss in coming chapters.
REFERENCES
1. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453.
2. Chacón A, Moure JC, Espinosa A, Hernández P: n-step FM-index for faster pattern matching.
Proc Comput Sci 2013, 18:70–79.